Extraction of Core Contents from Web Pages

نویسنده

  • Sandeep Sirsat
چکیده

The information available on web pages mostly contains semi-structured text documents which are represented either in XML, or HTML, or XHTML format that lacks formatted document structure. The document does not discriminate between the text and the schema that represent the text. Also the amount of structure used to represent the text depends on the purpose and size of text document. No semantic is applied to semi-structured documents. This requires extracting core contents of text document to analyse words or sentences to generate useful knowledge. This paper discusses several techniques and approaches useful for extracting core content from semi-structured text documents and their merits and demerits.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Presenting a method for extracting structured domain-dependent information from Farsi Web pages

Extracting structured information about entities from web texts is an important task in web mining, natural language processing, and information extraction. Information extraction is useful in many applications including search engines, question-answering systems, recommender systems, machine translation, etc. An information extraction system aims to identify the entities from the text and extr...

متن کامل

Visual Architecture based Web Information Extraction

ISSN 2250 – 107X | © 2011 Bonfring Abstract--The World Wide Web has more online web database which can be searched through their web query interface. Deep Web contents are accessed by queries submitted to Web databases and the returned data records are enwrapped in dynamically generated Web pages. Extracting structured data from deep Web pages is a challenging task due to the underlying complic...

متن کامل

Optimized Content Extraction from web pages using Composite Approaches

The information available today on web is tremendous and comes with greater challenges. Content extraction identifies the main content and removes the clutter from web pages. The main problem in extracting the content from the web page is the newer architecture of web pages and the diversity in the structure of web pages. Optimized content extraction from HTML documents using collective approac...

متن کامل

Deep Web Data Extraction by Using Vision-Based Item and Data Extraction Algorithms

Deep Web contents are accessed by queries submitted to Web databases and the returned data records are enwrapped in dynamically generated Web pages (they will be called deep Web pages in this paper). Extracting structured data from deep Web pages is a challenging problem due to the underlying intricate structures of such pages. Until now, a large number of techniques have been proposed to addre...

متن کامل

An Efficient Image Based Approach for Extraction of Deep Web Data

The Internet presents a huge amount of useful information which is usually formatted for its users, which makes it difficult to extract relevant data from various sources. Deep Web contents are extracted by submitting the queries to semi structured Web databases and the returned data records are enwrapped in dynamically generated Web pages. Extracting structured data from deep Web pages is a ch...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • CoRR

دوره abs/1403.1939  شماره 

صفحات  -

تاریخ انتشار 2014